In this step, the amount of data missing is calculated for each variable in the cleaned dataframe. Based on these percentages, a subset is created that includes only those variables with 80% or more of the data present. This helps filter out the variables with very large amounts of data that could bias our regression models.
| Variable | Missing_Percentage |
|---|---|
| sex_at_birth | 0.0000000 |
| state | 0.0000000 |
| eval_type | 0.0019731 |
| has_diabetes | 0.0022708 |
| good_health | 0.0029124 |
| stroke | 0.0034016 |
| asthma_ever | 0.0039255 |
| kidney_disease | 0.0043663 |
| high_bp | 0.0044286 |
| bronchitis | 0.0047678 |
| education | 0.0053655 |
| cancer | 0.0053932 |
| arthritis | 0.0059078 |
| depression | 0.0059701 |
| michd | 0.0105810 |
| age_category | 0.0179520 |
| urb_rural | 0.0192074 |
| race | 0.0220851 |
| blind | 0.0362547 |
| height | 0.0510428 |
| smoker | 0.0532213 |
| ecigs | 0.0536574 |
| heavy_drinking | 0.0757518 |
| binge_drinking | 0.0770995 |
| covid | 0.0786803 |
| obese | 0.0935445 |
| chol_check | 0.1271200 |
| chol_meds | 0.1284284 |
| physical_activity | 0.1961147 |
| income | 0.1999040 |
This test intends to determine if variables are Missing at Random (MAR) or Missing Not At Random (MNAR), so we can decide whether to include them in a regression model.
Missingness of asthma_ever: 0.0039
Missingness of
asthma_now : 0.856 Missingness of smoker:
0.0532
asthma_now has over 85% missingness, which does not bode
well for any kind of model building or data analysis. However, we can
analyze if this missingness is related to the smoker
variable, which has low missingness.
##
## Pearson's Chi-squared test
##
## data: data$missing_indicator and data$smoker
## X-squared = 372.22, df = 3, p-value < 2.2e-16
We reject the null hypothesis, as the p-value
2.3008962^{-80} is very small, indicating that missingness in
asthma_now is associated with missingness in
smoker and probably MNAR. The results of the chi-square
test show that using asthma_now in our regression is likely
to bias our results.
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: data$missing_indicator and data$urb_rural
## X-squared = 31.129, df = 1, p-value = 2.414e-08
We reject the null hypothesis, as the p-value 2.4143979^{-8} is very
small, indicating that missingness in education is
associated with missingness in urb_rural. Both these
variables have low missingness independently, but the results of the
chi-square test show that using these variables in our regression could
bias our results or that a trend we see in one could be affected by the
other.
Let’s test for income vs. urban/rural.
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: data$missing_indicator and data$urb_rural
## X-squared = 43.056, df = 1, p-value = 5.319e-11
We reject the null hypothesis, as the p-value 5.318595^{-11} is very
small, indicating that missingness in income is associated
with missingness in urb_rural. urb_rural has
low missingness independently, and income just below 20%
missingness, but the results of the chi-square test show that
missingness in one could be related to the other. In this case we choose
not to include these variables in the model, furthered by our
conclusions from our EDA, showing no significant relationship.
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: data$missing_indicator and data$bronchitis
## X-squared = 45.436, df = 1, p-value = 1.577e-11
We reject the null hypothesis, as the p-value
1.5773734^{-11} is very small, indicating that missingness in
covid is associated with missingness in
bronchitis. Both these variables have low missingness
independently, so this result shows us that these two variables could be
strongly associated and perhaps used as an interaction consideration in
our regression model.
Here we test for the significance of this interaction. Based on this result, with p-value 3.7306598^{-123} being very small, we can proceed with including this interaction in our regression model.
We choose to run two GLM models, one for has_diabetes,
whether someone has diabetes or not, and one for eval_type,
the type of diabetes that someone has.
Variables Chosen and Rationale:
sex_at_birth: no missingness, widely known predictor of
diabetes risk; for men and women at the same BMI, men have a
higher risk.
age_category: age and type of diabetes tend to be
related.
obese: has much lower missingness and a similar effect
as physical_activity on T2D risk.
race: has been known
to affect predisposition to diabetes.
good_health: does being in generally good health
mitigate risk of diabetes.
kidney_disease: there is
a bidirectional link between diabetes and kidney disease.
stroke: diabetes is associated with risk of
cardiovascular complication like stroke.
bronchitis: bronchitis is a
comorbidity of T2D, and covid and bronchitis may interact.
covid: between
diabetes and covid, risk of one may increase risk of the other.
First, let us generate an initial model based on the predictive
variables identified through our literature review. Through this model,
we will assess which of the predictive variables actually have a
statistically significant association with the has_diabetes
variable in our dataset.
# Fit a GLM model
glm_model_db <- glm(
has_diabetes ~
sex_at_birth + age_category + race + obese + good_health +
kidney_disease + stroke + covid * bronchitis,
data = data,
family = binomial(link = "logit") # Logistic regression
)
Based on the GLM above, we conclude that the following variables have
a statistically significant association with the
has_diabetes variable.
Here is a table of these significant predictors, arranged by significance (smallest to largest p-value).
| Variable | Coefficient | P-value |
|---|---|---|
| age_category55-59 | 2.659 | 0.00e+00 |
| age_category60-64 | 2.820 | 0.00e+00 |
| age_category65-69 | 2.990 | 0.00e+00 |
| age_category70-74 | 3.077 | 0.00e+00 |
| age_category75-79 | 3.135 | 0.00e+00 |
| age_category80+ | 2.936 | 0.00e+00 |
| obeseYes | -1.006 | 0.00e+00 |
| good_healthYes | -1.024 | 0.00e+00 |
| kidney_diseaseYes | 0.925 | 0.00e+00 |
| age_category50-54 | 2.412 | 4.94e-277 |
| age_category45-49 | 2.083 | 7.53e-200 |
| age_category40-44 | 1.624 | 6.17e-117 |
| strokeYes | 0.432 | 1.17e-101 |
| raceWhite | -0.644 | 4.31e-70 |
| age_category35-39 | 1.244 | 7.29e-65 |
| sex_at_birthMale | 0.157 | 2.10e-50 |
| age_category30-34 | 0.766 | 1.87e-22 |
| raceMultiracial | -0.393 | 6.62e-15 |
| raceOther | -0.477 | 7.03e-12 |
| bronchitisYes | 0.110 | 5.29e-07 |
| age_category25-29 | 0.321 | 2.74e-04 |
| raceHispanic | -0.125 | 1.72e-03 |
| raceNative Hawaiian/Pacific Islander | 0.166 | 3.00e-02 |
Let us now generate an model based on the same predictive variables
as before. However, this model will assess which of the predictive
variables actually have a statistically significant association with the
diab_type variable in our dataset.
# Fit a GLM model
glm_model_dbtype = diabetes_df |>
filter(!is.na(diab_type)) |>
mutate(diab_type = case_match(diab_type, "Type 1" ~ 0, "Type 2" ~ 1)) |>
glm(diab_type ~ sex_at_birth + age_category + race + obese + good_health +
kidney_disease + stroke + covid * bronchitis,
data = _,
family = binomial(link = "logit") # Logistic regression
)
Based on the GLM above, we conclude that the following variables have
a statistically significant association with the diab_type
variable.
Here is a table of these significant predictors, arranged by significance (smallest to largest p-value).
| Variable | Coefficient | P-value |
|---|---|---|
| age_category30-34 | 1.163 | 6.43e-05 |
| age_category35-39 | 1.682 | 2.11e-09 |
| age_category40-44 | 1.692 | 2.14e-10 |
| age_category45-49 | 2.189 | 9.18e-17 |
| age_category50-54 | 2.341 | 1.01e-19 |
| age_category55-59 | 2.455 | 4.42e-22 |
| age_category60-64 | 2.894 | 5.70e-30 |
| age_category65-69 | 2.850 | 1.22e-29 |
| age_category70-74 | 3.195 | 5.74e-36 |
| age_category75-79 | 3.304 | 2.12e-37 |
| age_category80+ | 3.355 | 5.24e-38 |
| bronchitisYes | 0.338 | 4.97e-03 |
| covidYes | 0.120 | 3.80e-02 |
| covidYes:bronchitisYes | -0.336 | 4.78e-02 |
| obeseYes | -0.948 | 1.40e-47 |
| raceBlack | -0.687 | 5.23e-03 |
| raceHispanic | -0.839 | 4.96e-04 |
| raceWhite | -0.683 | 3.50e-03 |
| sex_at_birthMale | -0.158 | 3.30e-03 |
First, let us generate test-training pairs from the diabetes data for
cross-validation, using the crossv_mc function. In this
k-fold cross validation method, we are generating 100 random partitions,
splitting the dataset into two subsets, holding out a test subset of the
data for training.
crossval_df = diabetes_df |>
filter(!is.na(diab_type)) |>
mutate(diab_type = case_match(diab_type, "Type 1" ~ 0, "Type 2" ~ 1)) |>
crossv_mc(100) |>
mutate(
train = map(train, as_tibble),
test = map(test, as_tibble)
)
Next, let us finalize our two models, fit them to the train subsets, and extract root-mean-squared-error (RMSE) to assess predictive accuracy of the models.
# GLM model for predicting has_diabetes
db_model = glm(db_formula, data = diabetes_df, family = binomial(link = "logit"))
# GLM model for predicting diab_type
dbtype_model = diabetes_df |>
filter(!is.na(diab_type)) |>
mutate(diab_type = case_match(diab_type, "Type 1" ~ 0, "Type 2" ~ 1)) |>
glm(dbtype_formula,
data = _,
family = binomial(link = "logit"))
Constructing a dataframe with the RMSEs for each model.
# Construct dataframe with extracted RMSE
crossval_results = crossval_df |>
mutate(
db_pred = map(train, \(x) glm(db_formula, data = x)),
dbtype_pred = map(train, \(x) glm(dbtype_formula, data = x))
) |>
mutate(
rmse_db = map2_dbl(db_pred, test, rmse),
rmse_dbtype = map2_dbl(dbtype_pred, test, rmse)
)
Finding the average RMSE for each model.
crossval_results |>
summarize(
`RMSE for has_diabetes model` = format(mean(rmse_db), scientific = TRUE, digits = 3),
`RMSE for diab_type model` = format(mean(rmse_dbtype), digits = 3)
) |>
knitr::kable()
| RMSE for has_diabetes model | RMSE for diab_type model |
|---|---|
| 2.25e-13 | 0.277 |
From the results of our cross validation and error assessment, we can
see that the second model, which predicts diab_type appears
to perform relatively well, especially considering the volume of missing
data we were dealing with. This model had a mean RMSE of less than 30%
across the 100-fold validation we performed.
On the other hand, we can see that there are likely some issues with
the first model, which predicts has_diabetes. The mean RMSE
appears to be very low, close to 0, which is highly unrealistic. It
hints at deeper issues within the model, perhaps stemming from the
unusually low p-values of significance shown by several of the predictor
variables of that model. Unfortunately, due to limited timeframe and
project scope, we were not able to pursue this issue further. In future
exploration, we would dedicate more time to investigating, correcting,
and refining our models.